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The enormous size of the web and the vagueness of the terms used to 
formulate queries still pose a huge problem in achieving user satisfaction. To 
solve this problem, queries need to be disambiguated based on their context. 
One well-known technique for enhancing the effectiveness of information 
retrieval (IR) is query expansion (QE). It reformulates the initial query by 
adding similar terms that help in retrieving more relevant results. In this 
paper, we propose a new QE semantic approach based on the modified 
Concept2vec model using linked data. The novelty of our work is the use of 
query-dependent linked data from DBpedia as training data for the 
Concept2vec skip-gram model. We considered only the top feedback 
documents, and we did not use them directly to generate embeddings; we 
used their interlinked data instead. Also, we used the linked data attributes 
that have a long value, e.g., “dbo: abstract”, as training data for neural 
network models, and, we extracted from them the valuable concepts for QE. 


Our experiments on the Associated Press collection dataset showed that 
retrieval effectiveness can be much improved when a skip-gram model is 
used along with a DBpedia feature. Also, we demonstrated significant 
improvements compared to other approaches. 
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1. INTRODUCTION 

The explosion and diversity of data on the Web has made information more available but difficult to 
use and less relevant to the user. As well the information retrieval systems (IRS) return a high number of 
unrelated results to the user because of the imprecise query and the query optimization issue [1]. To express 
the user’s intention more clearly, a step of reformulating the queries is often necessary. Reformulation can be 
done by expanding queries, i.e., enriching them through adding new terms extracted by different possible 
term selection methods. This expansion of the original query can solve both the problem of the information’s 
insufficiency in the user’s query and the problem of vocabulary mismatch between the query terms and the 
documents’ terms [2]. 

Query expansion was first proposed in 1960 by Maron and Kuhns [3]. It can be divided into; 
i) Global approaches, which are considered as query-independent, all documents are analyzed for all queries 
[4], 11) Local approaches, which find expansion words that, are closely related to the original query words [5]. 
Recently, linked open data (LOD) knowledge bases are used to expand queries by taking into consideration 
the context. But the main challenges that can be faced is the lack of domain-specific LOD sources since 
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domain-independent sources do not cover a high number of technical terms. For instance, DBpedia Spotlight 
fails sometimes to annotate some specialized entities because they are not covered by it e.g., “operating 
system” (OS). Also, every specialized area has its vocabulary items (e.g., specific features) and 
characteristics that need to be taken into consideration [6]. In addition to this, many attributes from LOD 
have a whole list of values [7]. And it can be difficult to determine the adequate value to use for expansion 
regarding their size. 

Also, Word embedding is applied to achieve semantic relatedness in IR, it allows predicting 
adjacent terms for a certain word or context by capturing term proximity and similarity [8]. One problem of 
embedding models is the fact that they vary in the quality of the generated embeddings especially that we are 
lacking metrics to evaluate the quality of embeddings. As for ontological concepts, embeddings of the 
entities dbr: Mosco (an entity of type DBpedia PrivateCompany), dbr: Paris, dbr: Dublin are supposedly 
close to the concept dbo: City (an entity of type DBpedia ObjectProperty) and far from entities such as dbr: 
Barack_Obama, dbr: Bill_Clinton which are associated with the concept dbo: President. 

To resolve this issue, we propose a new query expansion method that relies on DBpedia concepts to 
generate the vectors. Our approach extends our previous work on association-based query expansion [9]. And 
instead of using cosine similarity only at the end of the embedding process as an evaluation metric of the 
embeddings’ quality as in [10]; we use it before the embedding to insure having only semantically related 
concepts from the start in the training data. Moreover, our approach aims at reducing the number of 
expansion concepts. Also, it aims at expending the remaining expansion concepts with their interlinked data 
using the same modified Concept2Vec model and a different DBpedia attribute. 

Our paper is divided into 6 sections. Section 2 presents the preliminaries. Section 3 describes the 
related work. Section 4 details the proposed approaches. Section 5 presents the experimental results. Finally, 
section 6 concludes this paper. 


2. PRELIMINARIES 

This section presents the necessary background for understanding our proposal. The first element is 
the term frequency-inverse document frequency (TF-IDF) technique. The second element is the popular word 
embedding technique called Word2Vec. 


2.1. Term frequency-inverse document frequency 

TF-IDF is a popular statistical measure, which is widely used in text mining. It aims to weight the 
importance of each word in a particular document [11], [12]. Term frequency tf (t, d) stands for the number 
of times the term t appears in the document d. Inverse document frequency idf(t) depends mainly on the total 
number of documents in the corpus C, and the document frequency df(t) representing the number of 
documents that contain the term t. The term idf(t) is expressed as in (1), 


N 
idfi = log Go (1) 


TF-IDF weight combines both the term frequency (TF) and the inverse document frequency (IDF) to 
estimate the weight for each term t in the document d as in (2), 


tf-idf¢ a = tf x idf (2) 


2.2. Word embedding 

With the success of deep learning, several techniques based on neural networks have become 
increasingly popular to convert words into meaningful vectors. The so-called word embedding stands for a 
class of predictive models, which is commonly adopted to learn vector representations of words from a 
corpus of documents. It is designed to capture the context of words, semantic and syntactic similarity, and the 
relationship with other words. The underlying rational behind it is that the distance between vectors 
determines the similarity in terms of meaning and semantic. 

In general, Word2vec is a well-known word embedding technique, which was introduced by 
Mikolov et al. [13]. It has gained a great popularity due to its good performance. It is a shallow neural 
network model designed for learning distributed representations of words, where each word can be 
represented as a vector. The continuous representations of words are meaningful for a variety of real-world 
problems, including, machine translation, information retrieval, text classification, and so on. 

In particular, there are two variants of Word2Vec called skip-gram model and continuous bag-of- 
words (CBoW) model, 
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—  Skip-gram model [13], [14]: It is a neural network model, which is composed of three layers, called, 
input layer, projection layer, and output layer. It is designed to learn representations of words and 
predict the context (i.e., the surrounding words of a target word). Figure | shows the architecture of the 
skip-gram model. 


Output 


Projection 
Input [] wit-2) 


C] wit-1) 


[| w(t+1) 
[] w(t+2) 


Figure 1. Architecture of the skip-gram model [13] 


wit) [] —_— [| 


Let k be a parameter representing the size of the context window on a single side. Also, let w (1), W (2), 
wy) Wt-l, W(t) be a sequence of words. For the target word denoted wi, the context window can be 
expressed by: [W (4, ..-, Wil), Wi), Witt), ++» Witeky]. The target word wa is converted into a vector using 
one hot encoding by placing 1 in the position of that word and 0 for the other words, shows in Figure 1. 
There is only one hidden layer that performs the dot product between the weight matrix and the input 
vector Wa) [15]. The result of the dot product at the hidden layer is passed to the output layer that 
computes the dot product between the output vector of the hidden layer and the weight matrix of the 
output layer. The softmax activation function is used to infer the probability of predicting the word we) 
appearing in the context given the target word Wa) as in (3), 


exp( Uw ce45) Tyw ty ) 


Vv 
HL 7( rT a7) 


P(Wesrplwe) = (3) 


V represents the number of words in the vocabulary. The vectors uw and Vw denote the input and output 
vector representations of the word w. This model aims to maximize the average log probability, which 
is written as in (4) [16], 


= Dt [DjL-«log P(wesplwe)| oe 


where k is the size of the context window on a single side. 

— Continuous bag of words (CBoW) model: It is a neural network architecture, which is an alternative to 
skip-gram model. It is designed to predict the target word in a sentence based on the context. CBoW is 
much faster than skip-gram and gives a better frequency for frequent words [13]. 

— Resource description framework to vector (RDF2Vec) [17]: it was introduced to learn embeddings from 
the resource description framework graphs after converting them into a set of sequences, because such 
algorithms require a propositional feature vector representation of data, where each instance is 
represented by a vector of features that are binary, numerical or nominal (symbols) [18]. 


3. RELATED WORKS 

Many query expansions approaches have been studied, some of them focused on local and 
vocabulary analysis. Schiitze et al. [4], authors analyze word occurrences and relationships in the whole 
corpus to automatically derive a thesaurus. They are inability to handle ambiguous terms from the query; 
because they process each query term separately from the others. As a result, the expansion of the query 
“apple computer” will cause a query drift. Alternatively, grammatical relations can be used, e.g., entities that 
are grown, cooked, eaten, and digested, are more likely to be food items. However, even if using grammatical 
dependencies is more accurate, using word co-occurrence is more robust because it cannot be misled by 
parser errors. 

Rocchio [5], authors refer to the relevance feedback (RF) and pseudo-RF (PRF). RF is based on the 
user’s manual judgment on some of the retrieved documents and the use of this feedback information to 
expand the query. PRF is considered only the most retrieved documents as relevant, this may decrease the 
quality of results for difficult queries, in particular since the top retrieved documents may be irrelevant. In 
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other words, the efficiency of PRF depends directly on the quality of the feedback documents. Other related 
query expansion approaches used Word embedding and linked open data knowledge bases Table 1. 


Table 1. Comparison between related work and our approach in terms of significant improvements 


Related work 


Related work 


Limits 


Significant improvements of 
our approach 


Zamani and Croft [8] expanded queries using the 
relevance-based word embedding approach: Relevance 
likelihood maximization (RLM). They trained word 
embedding models using queries as well as the top- 
ranked feedback documents of each query. 


Baroni et al. [23] expanded queries either by: (i) directly 
using candidate terms that are closest to the query in the 
embedding space to compute the mean cosine similarity 
between a candidate term and all the query terms, or by 
(ii) first restricting the search domain for the candidate 
expansion terms to the top-ranked documents instead of 
the whole collection of documents then applying the 
previous approach. 

Rattinger et al. [24] evaluated the quality of embeddings 
using the following metrics: (i) semantic relatedness 
based on human judgment, (ii) synonym detection which 
uses cosine similarity to compare the target word with all 
the choices displayed, (iii) concept categorization that 
groups concepts in taxonomic order, (vi) selectional 
preference that uses noun-verb pairs to capture the 
relevancy of a noun. 

Imani et al. [25] trained the skip-gram model either on 
the whole corpus or on the English-language edition of 
Wikipedia. 

Kuzi et al. [26] suggested a deep expansion classifier 
(DEC) that used pre-trained word embeddings as inputs 
for the classifier that classified candidate expansion terms 
into good terms for expansion, bad terms, and neutral 
ones. 

Lavrenko et al. [18] converted linked open data graphs 
into a set of entities’ sequences using graph walks and 
Weisfeiler-Lehman Subtree RDFgraph kernels. Then they 
used those sequences to train a neural language model 
estimating the likelihood of an entities’ sequence 
appearing in a graph. 

Alshargi et al. [10] evaluated the quality of concepts in 
the embeddings based on: (i) the categorization aspect by 
considering the rdf: type (resource description 
framework) property, (ii) the relational aspect through 
considering a relation as valid based on the entity’s types. 
Dahir et al. [27] used the query as a whole in CBoW to 
determine expansion terms from the entire list of 
feedback documents. Then, they integrated these terms 
with the pseudo-feedback-based relevance model (RM). 
Patel et al. [28] employed WordNet synonyms for the 
query title and DBpedia features for the query description 
field. 


Dahir et al. [9] proposed an approach, which is designed 
to annotate feedback documents of the initial query using 
DBpedia. Then, for each annotated DBpedia entity, the 
relevant “dct: subject” (an entity of type Thing) is 
adopted to find all entities having this subject as one of 
the “dct: subject” attribute’s values using SPARQL 
protocol and RDF query language (SPARQL). 


Even though useful information may be 
captured from the top retrieved documents 


[19]-[21], pseudo relevance feedback 
(PRF) may decrease retrieval 
performance, especially for difficult 


queries, unlike relevance feedback (RF) 
[22] which is very effective in improving 
retrieval performance [5], [21]. 

None use of linked data has. 


Cosine similarity is used as an evaluation 
metric only at the end of the embedding. 


Since the used dataset is small, the default 
number of iterations is set from 5 to 20. 


None exploitation of feedback documents. 


The relation between the query and the 
feedback documents is not exploited. 


The enormous number of properties may 
make it difficult to judge the embedding 
only based on one property. 


None use of external sources. 


WordNet synonyms may lead to lower 
results if not picked carefully. 

Not all datasets have queries that have a 
query description field. 

The SPARQL approach returns a long list 
of concepts and needs to be further 
exploited since it carries valuable 
information. 


Use of training data that are 
related to the used queries 
insted of using random 
queries for training. 


Use of linked data in 
different stages of the 
process: both before and 
during the embedding. 

Cosine similarity is used 


early in the process, before 
the embedding, to make sure 
that the training data are 
relevant to the query. 


The top documents and the 
query are considered along 
with their interlinked data. 
Exploitation of feedback 
documents. 


Exploitation of the relation 
between the query and the 
feedback documents through 
the use of linked data. 


Cosine similarity is used 
early in the process, before 
the embedding, to make sure 
that the training data are 
relevant to the query. 

Use of DBpedia to obtain 
interlinked data. 


None use of WordNet. 
None use of any other field 
of the query. 


Use of the SPARQL 
appoach results as training 
data for the  skip-gram 
model. 


4. METHOD 


This section details our proposal, which is an improved variant of our previous method “cosine- 
similarity (COS-SIM) on linked vectors” [9]. In particular, we proposed two approaches called Label-based 
modified Concept2vec Approach and abstract-based modified Concept2vec approach. Each one is based on 
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the skip-gram model illustrated in Figure 1 and a DBpedia feature (“‘dbo: abstract”) that was not exploited in 
the earlier work. Our modified Concept2vec approaches consist of the following steps, which are shown in 
Figures 2 and 3. 


List of concepts with a subject 
in common: “dct: subject” 


Annotation Initial query + Their top feedback => 
documents’ terms 
SPARQL query [9] 
Initial query’s 
DBpedia concepts ‘ o~ 
Reduced query: ae ae eae . modified Concept2vec 
extract two most expansion concept <e={ Simiarity evaation Skip-gram model 


aa a scm a ws eee ee we 


Expansion concepts | 


Reduced query from Label-based DBpedia entities: 3 
modified Concept2Vec dbo: abstract(s) 


| Valuable expansion 10 most similar terms to each modified Concept2vec 
concepts reduced query concept Skip-gram model 
Ss ac a ent wae k 'ab a ce nn ce 1 


Figure 3. Flowchart of our proposed abstract-based modified Concept2vec approach 


Label-based modified Concept2vec approach, 
Annotation of both the query’s expressions and the top associated feedback documents’ terms using 
DBpedia Spotlight; 
Determination, for each annotated DBpedia concept, of a “dct: subject” that contains the concept’s 
label. If there is not any, we use the first subject; 
Use of the SPARQL Protocol and RDF Query Language to find a list of concepts that have a 
determined dct: subject in common. It is the SPARQL [9] approach; 
Comparison between the lists of concepts (previous step) that are ralted to query conepts with those 
related to the top documents using cosine similarity to determine expansion concepts. It is the “COS- 
SIM on linked vectors” [9] approach; 
Use of the SPARQL [9] lists of concepts as well as the “COS-SIM on linked vectors” [9] expansion 
concepts as data for the modified Concept2vec model; 
Creation of the skip-gram model using "(3) and (4)". The number of context words we are looking at is 
5 concepts before the input and 5 concepts after it; 
Evaluation of the similarity between the initial query concepts and the “COS-SIM on linked vectors” 
expansion concepts from DBpedia using the following line on python: 
Similarity([QueryConcept]','[COS-SIMonlinkeddataConcept]'). 
We used the "(5)" to calculate the similarity between the input vector A of the input word and the output 
vector B of the target word, and the normalization softmax [2] "(6)" to transform a set of given real 
values in the range of [0,1], such that the combined sum is " | " [29]. 


jad 


similarity = cos(0) = ——— = ie ABi__ (5) 
WAIMIBI [on a? [om Be 

Where Aj and Bj are components of vector A and B respectively. 

o(z); = ae fori = 1,...,K and z = (z,,...,2,) € R® (6) 
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— Reduction of the “COS-SIM on linked vectors” expansion concepts to 2 based on the similarity results. 
In other words, only 2 expansion concepts from the earlier work [9], having the highest similarity with 
initial query concepts, are kept in the label-based modified Concept2vec approach shows in Figure 2. 
We opted for only 2 concepts because the number of expansion concepts in the “COS-SIM on linked 
vectors” [9] approach is already small. 

Abstract-based modified Concept2vec approach, 

— Annotation of the label-based modified Concept2vec expanded query using DBpedia; 

— Determination of the DBpedia entities within the query from the previous step; 

— Use of the found entities’ abstracts (i.e., values of “dbo: abstracts”) as the new data for the earlier skip- 
gram model (step 6); 

— Determination of the 10 most similar terms to each concept from the previously reduced query. Table 2 
shows an example of the results of the previous function on each DBPedia concept from the reduced 
query number 46 “tracking computer virus outbreaks” in Text Retrieval Conference Associated Press 88- 
90 (TREC AP88-90). 

— Use of the DBpedia entities within the found terms, in the previous step, as expansion concepts for the 
abstract-based modified Concept2vec query shows in Figure 3. And further expansion of the expansion 
entities using their associated “rdf: type”, “dct: subject”, and domain-dependant ontology that have a 
maximum of 2 values. 

The label-based modified Concept2vec method is used as a refinement of the earlier method to 
minimize the expansion terms and keep only the most efficient ones. Whereas the abstract-based modified 
Concept2vec method further expands the new reduced query by employing the long values of “dbo: abstract” 
in skip-gram. In other words, the abstract-based modified Concept2vec method is a continuation of the label- 
based modified Concept2vec method. 

In DBpedia, annotated concepts can change depending on lower and upper case. Consequently, were 
annotated as two separate concepts. But it is clear that from the most similar terms in Table 2 that DBpedia 
does take into account the context. Thus, for “virus” we have terms like “software” and “malware” which are 
semantically related to the query. 


Table 2. Ten most similar words to query number 46, of TREC AP88-90, using our abstract-based modified 
Concept2vec approach 


Query concepts Most similar terms and cosine similarity 
computer monster (0.2483), trojan (0.2199), vector (0.1844), fk (0.1701), beast (0.1609), strang (0.1530), fester (0.1492), mac 
(0.1396), dunihi (0.1123), comparison (0.1108) 
virus multipartite (0.2167), zero (0.1925), goat (0.1874), comparison (0.1382), multigrainmalwar (0.1346), download 


(0.1117), infect (0.1107), cooki (0.1069), rabbit (0.0908), softwar (0.0877) 


5. RESULTS AND DISCUSSION 
5.1. Dataset and evaluation measures 

In this work, we used the collection of documents TREC AP88-90 [30] shows in Table 3 and 10 of 
its associated queries. We opted for a sample of queries because we believe it is enough to do the 
experiement. The length of these queries ranges from | to 6 keywords as in the previously mentioned query 
number 46 “tracking computer virus outbreaks”. Also, we used TF-IDF as a retrieval model. 


Table 3. Description of the dataset 
Number of documents _ Average document size Document relevancy Topics (queries) numbers 
158,240 261 0 (non-relevant), 1 (relevant) 1-50, 51-100, and 101-150 


To evaluate the performance of our proposal, we employed three popular metrics, called, precision, 
recall, and mean average precision [31]. They are widely used to measure the prediction accuracy: 
— Precision: it is defined as in (7), 
Number of relevant retrieved documents 


Precision = 7 
Number of retrieved documents ( ) 


— Recall: it is known as the true positive rate. It shows the capability of the system for returning all the 
relevant documents. It is expressed as in (8), 
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Number of relevant retrieved documents 


Recall = (8) 


Number of relevant documents 


mean average precision (MAP): For a set of queries, MAP represents the mean of the average precision (AP) 
scores for each query. It is represented as in (9), 


Doni AveP(a 


) x : 
and AveP= Dkei P(k)arel(k)) 


Number of relevant documents 


MAP = C3) 


where Q is the number of queries. Rel(k) is an indicator function that is equal to | if the element at rang « k » 
is arelevant document and zero otherwise. 


5.2. Results and discussion 

This section is intended to compare the performance of the two proposed models named label-based 
modified Concept2vec approach and abstract-based modified Concept2vec approach with existing competitors. 
Figures 4-7 show the results obtained using our proposed query expansion approaches, based on modified 
Concept2vec and RDF knowledge graph. Figure 7 is intended to check the MAP of the methods in terms of the 
top retried documents (MAP@10). Since users mainly check the results within the first page (the top ones). 
Whereas Figure 6 checks the MAP of the whole set of retrieved documents and not only the top ones. 


0.3 
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c 
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Figure 4. Precision at different cut-off ranks, for our two approaches and the baseline method 
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Figure 5. Recall at different cut-off ranks, for our two approaches and the baseline method 


Query expansion based on modified Concept2vec model using ... (Sarah Dahir) 


COS-SIM on 


linked vectors 


Label-based 
modified 


Concept2vec 


Method 


Abstract-based 
modified 
Concept2vec 


762 0 ISSN: 2252-8938 


0.4 - 
0.395 
0.395 
© 0,39 0.389 
ea 
& 0.385 
0.38 
= 0.38 
0.375 - 
0.37 
COS-SIM on Label-based Abstract-based 
linked vectors modified modified 
Concept2vec Concept2vec 
Method 


Figure 6. MAP@10 for our two approaches and the baseline method 
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Figure 7. Results of the comparison of the proposed approaches with related works 


From Figures 4-6, we noticed significant improvements in terms of precision at all cut-off ranks for 
both approaches compared to the baseline. Furthermore, the abstract-based modified Concept2vec bettered 
slightly the label-based modified Concept2Vec for P@5 and P@10 only. Whereas the label-based modified 
Concept2vec; improved the abstract-based modified Concept2vec in terms of P@15, P@20, and P@30. In 
addition, we have achieved significant improvements at all recall levels; in particular, for the abstract-based 
modified Concept2vec. For instance, the R@10 improved from 0.140 for the baseline to 0.156 for label- 
based modified Concept2vec and 0.267 for abstract-based modified Concept2vec. Moreover, the MAP@10 
increased from 0.380 for the baseline to 0.389 for our label-based modified Concept2vec approach and 0.395 
for the abstract-based modified Concept2vec approach. 

From Figure 7, it is clear that our approaches outperform DEC [26] because although the expansion 
candidates are taken from an external source. They are semantically linked to feedback documents, whereas 
Kuzi et al. [26] no initial retrieval is performed. Consequently, good (if not better) terms from feedback 
documents are not exploited Kuzi et al. [26]. Another reason why our approaches perform better than DEC is 
our use of query-specific or dedicated training data. In other words, we used different training data for every 
query which is not the case for DEC where a general and non-domain specific corpus is used. 

Similarly, RLM [8] leads to lower results because initial retrieval is performed on publically 
available queries and not on AP queries. Also, no semantic source is used nor dedicated training datasets are 
chosen depending on the query. The approach still performs slightly better than DEC since it uses a 
considerably large number of queries for training. As for RM-Cent [27], lower results are due to expansion 
using only initial retrieval results, i.e., external sources that may contain good interlinked terms are not used. 
However, it is clear that including initial retrieval corpus in expansion methods in general, e.g., COS-SIM on 
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linked vectors [9], and Word2Vec methods in particular, e.g., RM-Cent [27], gives better results than using a 
random and general training corpus e.g., DEC [26] and RLM [8]. 

In the previous work [9], the use of the “SPARQL” method to expand queries did not give good 
results. We believed that it was due to the high number of concepts that may be used in that case for 
expanding a query. In the label-based modified Concept2vec, we tried to benefit from the earlier work’s 
“SPARQL” method [9] by using separately each of its expanded queries as data for our modified 
Concept2vec method. We believe that the results improved because each used “SPARQL” expanded query is 
query dependent unlike available training data, used in other related work, that tends to be general and not 
specific to a particular query. However, the “SPARQL” [9] queries may be short in some cases which may 
not be beneficial for our method that requires large training data. As for the abstract-based modified 
Concept2vec method, it further improved the results because it employs the interlinked data of the already 
bettered queries. Furthermore, we believe that our choice of the attribute to use as data for skip-gram was 
successful since “dbo: abstract” is one of the DBpedia attributes that are always available no matter what the 
domain of the entity is. Also, “dbo: abstract” is always mono-valued. Consequently, we will not have 
multiple values to choose from. Moreover, this attribute is often long enough to be used as data for neural 
networks. And in the future, we suggest varying the Linked data attributes for the “SPARQL” method [9] to 
increase the size of the data. 


6. CONCLUSION 

This work is designed to improve the query expansion method “COS-SIM on linked vectors” using 
two modified Concept2vec methods based on the skip-gram model. One advantage of our work is the use of 
training data that are semantically related to the query and relevant to it. To our best knowledge, unlike existing 
works that rely on general training data for Word2vec like Google News articles, our approaches improved 
significantly the results of related studies. On the one hand, our label-based modified Concept2vec approach 
that uses long vectors from the SPARQL approach as training data helped to improve the results of “COS-SIM 
on linked vectors” through query reduction. This approach restricts the number of expansion concepts to 2 
instead of keeping all concepts with a cosine similarity higher than 0 as in the earlier work. On the other hand, 
our abstract-based modified Concept2vec approach further improves the reduced query using DBpedia’s “dbo: 
abstract” of the entities within the new query as the new training data. Also, we judge better practice that the 
quality of word embeddings should benefit from bigger datasets. However, one limitation of our approach is the 
size of the data, which varies depending on the size of the vectors generated by the SPARQL query in the 
previous work, and on the size of “dbo: abstract” values. One way to alleviate this problem is the variation of 
used linked data attributes for the SPARQL query to increase the size of the data. And another way to improve 
our approach is by varying the parameter values of ‘size’ and ‘window’ and observing the variations in the 
results. These two perspectives represent the main topics of our current researches. 
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